LEAD SCORE ANALYSIS¶

Case Study - Summary¶

1. Company:¶

  • X Education (sells online courses to professionals).

2. Current Situation:¶

  • Many leads generated daily (via website visits, ads, social media, referrals).
  • Only ~30% of leads convert to paying customers.
  • Sales team spends excessive time contacting all leads, regardless of conversion likelihood.

3. Business Goal:¶

  • Identify “Hot Leads” (those most likely to convert).
  • Build a logistic regression model to assign each lead a score between 0–100.
  • Higher score → higher probability of conversion.
  • CEO aims to raise conversion rate to ~80% by focusing sales efforts on high-scoring leads.

4. Data Given:¶

  • ~9000 historical lead records.
  • Features include Lead Source, Total Time Spent on Website, Total Visits, Last Activity, etc.
  • Target variable: Converted (1 = converted, 0 = not converted).
  • Categorical variables contain a “Select” level → treat as missing/null.

5. Expected Results:¶

  • A Jupyter notebook with:
    • Logistic regression model.
    • Lead score predictions.
    • Model evaluation metrics (accuracy, precision, recall, ROC-AUC).
  • Business insights:
    • (a) Top 3 most influential variables overall.
    • (b) Top 3 categorical/dummy variables impacting conversion.
    • (c) Strategy when interns join → aggressive outreach to medium-score leads.
    • (d) Strategy when target is met → reduce outreach to low-score leads to save resources.
In [ ]:
 
In [1]:
# Import necessary libraries 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings 
warnings.filterwarnings('ignore')
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder,StandardScaler,PowerTransformer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn import metrics
from sklearn.metrics import precision_recall_curve

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

Step-1: Reading the Data¶

In [2]:
# Load and Read the dataset
df = pd.read_csv("Leads.csv")
df.head()
Out[2]:
Prospect ID Lead Number Lead Origin Lead Source Do Not Email Do Not Call Converted TotalVisits Total Time Spent on Website Page Views Per Visit ... Get updates on DM Content Lead Profile City Asymmetrique Activity Index Asymmetrique Profile Index Asymmetrique Activity Score Asymmetrique Profile Score I agree to pay the amount through cheque A free copy of Mastering The Interview Last Notable Activity
0 7927b2df-8bba-4d29-b9a2-b6e0beafe620 660737 API Olark Chat No No 0 0.0 0 0.0 ... No Select Select 02.Medium 02.Medium 15.0 15.0 No No Modified
1 2a272436-5132-4136-86fa-dcc88c88f482 660728 API Organic Search No No 0 5.0 674 2.5 ... No Select Select 02.Medium 02.Medium 15.0 15.0 No No Email Opened
2 8cc8c611-a219-4f35-ad23-fdfd2656bd8a 660727 Landing Page Submission Direct Traffic No No 1 2.0 1532 2.0 ... No Potential Lead Mumbai 02.Medium 01.High 14.0 20.0 No Yes Email Opened
3 0cc2df48-7cf4-4e39-9de9-19797f9b38cc 660719 Landing Page Submission Direct Traffic No No 0 1.0 305 1.0 ... No Select Mumbai 02.Medium 01.High 13.0 17.0 No No Modified
4 3256f628-e534-4826-9d63-4a8b88782852 660681 Landing Page Submission Google No No 1 2.0 1428 1.0 ... No Select Mumbai 02.Medium 01.High 15.0 18.0 No No Modified

5 rows × 37 columns

In [3]:
# check the shape
df.shape
Out[3]:
(9240, 37)
In [4]:
df_copy = df.copy()

Step-2: Understanding Data¶

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Prospect ID                                    9240 non-null   object 
 1   Lead Number                                    9240 non-null   int64  
 2   Lead Origin                                    9240 non-null   object 
 3   Lead Source                                    9204 non-null   object 
 4   Do Not Email                                   9240 non-null   object 
 5   Do Not Call                                    9240 non-null   object 
 6   Converted                                      9240 non-null   int64  
 7   TotalVisits                                    9103 non-null   float64
 8   Total Time Spent on Website                    9240 non-null   int64  
 9   Page Views Per Visit                           9103 non-null   float64
 10  Last Activity                                  9137 non-null   object 
 11  Country                                        6779 non-null   object 
 12  Specialization                                 7802 non-null   object 
 13  How did you hear about X Education             7033 non-null   object 
 14  What is your current occupation                6550 non-null   object 
 15  What matters most to you in choosing a course  6531 non-null   object 
 16  Search                                         9240 non-null   object 
 17  Magazine                                       9240 non-null   object 
 18  Newspaper Article                              9240 non-null   object 
 19  X Education Forums                             9240 non-null   object 
 20  Newspaper                                      9240 non-null   object 
 21  Digital Advertisement                          9240 non-null   object 
 22  Through Recommendations                        9240 non-null   object 
 23  Receive More Updates About Our Courses         9240 non-null   object 
 24  Tags                                           5887 non-null   object 
 25  Lead Quality                                   4473 non-null   object 
 26  Update me on Supply Chain Content              9240 non-null   object 
 27  Get updates on DM Content                      9240 non-null   object 
 28  Lead Profile                                   6531 non-null   object 
 29  City                                           7820 non-null   object 
 30  Asymmetrique Activity Index                    5022 non-null   object 
 31  Asymmetrique Profile Index                     5022 non-null   object 
 32  Asymmetrique Activity Score                    5022 non-null   float64
 33  Asymmetrique Profile Score                     5022 non-null   float64
 34  I agree to pay the amount through cheque       9240 non-null   object 
 35  A free copy of Mastering The Interview         9240 non-null   object 
 36  Last Notable Activity                          9240 non-null   object 
dtypes: float64(4), int64(3), object(30)
memory usage: 2.6+ MB
In [6]:
df.describe()
Out[6]:
Lead Number Converted TotalVisits Total Time Spent on Website Page Views Per Visit Asymmetrique Activity Score Asymmetrique Profile Score
count 9240.000000 9240.000000 9103.000000 9240.000000 9103.000000 5022.000000 5022.000000
mean 617188.435606 0.385390 3.445238 487.698268 2.362820 14.306252 16.344883
std 23405.995698 0.486714 4.854853 548.021466 2.161418 1.386694 1.811395
min 579533.000000 0.000000 0.000000 0.000000 0.000000 7.000000 11.000000
25% 596484.500000 0.000000 1.000000 12.000000 1.000000 14.000000 15.000000
50% 615479.000000 0.000000 3.000000 248.000000 2.000000 14.000000 16.000000
75% 637387.250000 1.000000 5.000000 936.000000 3.000000 15.000000 18.000000
max 660737.000000 1.000000 251.000000 2272.000000 55.000000 18.000000 20.000000

Step-3: Data Preparation¶

1. Null Value Handling¶

In [7]:
# Remove duplicates

df.drop_duplicates(inplace=True) # inplace = True to make the changes permanent
In [8]:
# Check for the null values using isnull() function

df.isnull().sum()
Out[8]:
Prospect ID                                         0
Lead Number                                         0
Lead Origin                                         0
Lead Source                                        36
Do Not Email                                        0
Do Not Call                                         0
Converted                                           0
TotalVisits                                       137
Total Time Spent on Website                         0
Page Views Per Visit                              137
Last Activity                                     103
Country                                          2461
Specialization                                   1438
How did you hear about X Education               2207
What is your current occupation                  2690
What matters most to you in choosing a course    2709
Search                                              0
Magazine                                            0
Newspaper Article                                   0
X Education Forums                                  0
Newspaper                                           0
Digital Advertisement                               0
Through Recommendations                             0
Receive More Updates About Our Courses              0
Tags                                             3353
Lead Quality                                     4767
Update me on Supply Chain Content                   0
Get updates on DM Content                           0
Lead Profile                                     2709
City                                             1420
Asymmetrique Activity Index                      4218
Asymmetrique Profile Index                       4218
Asymmetrique Activity Score                      4218
Asymmetrique Profile Score                       4218
I agree to pay the amount through cheque            0
A free copy of Mastering The Interview              0
Last Notable Activity                               0
dtype: int64
In [9]:
# converting null values into percentages for better understanding
round((df.isnull().sum()/len(df)*100),2)
Out[9]:
Prospect ID                                       0.00
Lead Number                                       0.00
Lead Origin                                       0.00
Lead Source                                       0.39
Do Not Email                                      0.00
Do Not Call                                       0.00
Converted                                         0.00
TotalVisits                                       1.48
Total Time Spent on Website                       0.00
Page Views Per Visit                              1.48
Last Activity                                     1.11
Country                                          26.63
Specialization                                   15.56
How did you hear about X Education               23.89
What is your current occupation                  29.11
What matters most to you in choosing a course    29.32
Search                                            0.00
Magazine                                          0.00
Newspaper Article                                 0.00
X Education Forums                                0.00
Newspaper                                         0.00
Digital Advertisement                             0.00
Through Recommendations                           0.00
Receive More Updates About Our Courses            0.00
Tags                                             36.29
Lead Quality                                     51.59
Update me on Supply Chain Content                 0.00
Get updates on DM Content                         0.00
Lead Profile                                     29.32
City                                             15.37
Asymmetrique Activity Index                      45.65
Asymmetrique Profile Index                       45.65
Asymmetrique Activity Score                      45.65
Asymmetrique Profile Score                       45.65
I agree to pay the amount through cheque          0.00
A free copy of Mastering The Interview            0.00
Last Notable Activity                             0.00
dtype: float64
In [10]:
# Handle 'SELECT' or placeholder values ( treat them as NaN)

df.replace('Select',np.nan, inplace=True)
In [11]:
# Drop columns with too many missing values [missing values >40%]

missing_ratio = df.isnull().mean()
cols_to_drop = missing_ratio[missing_ratio > 0.4].index
df.drop(columns = cols_to_drop,inplace=True)
In [12]:
df.shape
Out[12]:
(9240, 30)
In [13]:
# Handling null values for categorical columns (replacing with mode )
for col in df.select_dtypes(include='object').columns :
    df[col].fillna(df[col].mode()[0],inplace = True)
In [14]:
# Handling null values for numerical columns ( replacing with the median )

for col in df.select_dtypes(include =['int64','float64']).columns:
    df[col].fillna(df[col].median(),inplace = True)
In [15]:
df.isnull().sum()
Out[15]:
Prospect ID                                      0
Lead Number                                      0
Lead Origin                                      0
Lead Source                                      0
Do Not Email                                     0
Do Not Call                                      0
Converted                                        0
TotalVisits                                      0
Total Time Spent on Website                      0
Page Views Per Visit                             0
Last Activity                                    0
Country                                          0
Specialization                                   0
What is your current occupation                  0
What matters most to you in choosing a course    0
Search                                           0
Magazine                                         0
Newspaper Article                                0
X Education Forums                               0
Newspaper                                        0
Digital Advertisement                            0
Through Recommendations                          0
Receive More Updates About Our Courses           0
Tags                                             0
Update me on Supply Chain Content                0
Get updates on DM Content                        0
City                                             0
I agree to pay the amount through cheque         0
A free copy of Mastering The Interview           0
Last Notable Activity                            0
dtype: int64
Hence the null values in the data are removed. Therefore there are no null values in the data¶

2. Outliers Handling¶

In [16]:
# check the outliers present in the data set by check the quantilies [.25, .5, .75, .90, .95, .99]
df.describe(percentiles=[.25, .5, .75, .90, .95, .99])
Out[16]:
Lead Number Converted TotalVisits Total Time Spent on Website Page Views Per Visit
count 9240.000000 9240.000000 9240.000000 9240.000000 9240.000000
mean 617188.435606 0.385390 3.438636 487.698268 2.357440
std 23405.995698 0.486714 4.819024 548.021466 2.145781
min 579533.000000 0.000000 0.000000 0.000000 0.000000
25% 596484.500000 0.000000 1.000000 12.000000 1.000000
50% 615479.000000 0.000000 3.000000 248.000000 2.000000
75% 637387.250000 1.000000 5.000000 936.000000 3.000000
90% 650506.100000 1.000000 7.000000 1380.000000 5.000000
95% 655404.050000 1.000000 10.000000 1562.000000 6.000000
99% 659592.980000 1.000000 17.000000 1840.610000 9.000000
max 660737.000000 1.000000 251.000000 2272.000000 55.000000
In [17]:
# Visualize the outliers in the numerical columns
num_col = df.select_dtypes(include=['int64','float64']).columns
for i in num_col:
    if i != 'Converted':
        plt.figure(figsize = (6,4))
        sns.boxplot(df[i])
        plt.show()
    
#sns.boxplot(df['Lead Number'])
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [18]:
# Checking for outliers and Handling them using Inter Quartile Range

num_col = df.select_dtypes(include=['int64','float64']).columns

for i in num_col:
    Q1 = df[i].quantile(0.25)
    Q3 = df[i].quantile(0.75)
    IQR = Q3-Q1
    lower_bound = Q1-1.5*IQR
    upper_bound = Q3+1.5*IQR
    df[i] = np.where(df[i] > upper_bound, upper_bound,
                      np.where(df[i] < lower_bound, lower_bound, df[i])) # Capping the outliers
In [19]:
df.shape
Out[19]:
(9240, 30)
In [20]:
# Visualize the outliers in the numerical columns after handling them
num_col = df.select_dtypes(include=['int64','float64']).columns
for i in num_col:
    if i != 'Converted':
        plt.figure(figsize = (6,4))
        sns.boxplot(df[i])
        plt.show()
    
print("Outliers of the numerical columns are handled")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Outliers of the numerical columns are handled

3. Identify numerical and categorical columns in the dataset¶

In [21]:
numerical_col = df.select_dtypes(include = ['int64','float64']).columns
categorical_col = df.select_dtypes(include = ['object']).columns
numerical_col
Out[21]:
Index(['Lead Number', 'Converted', 'TotalVisits',
       'Total Time Spent on Website', 'Page Views Per Visit'],
      dtype='object')
In [22]:
df.head()
Out[22]:
Prospect ID Lead Number Lead Origin Lead Source Do Not Email Do Not Call Converted TotalVisits Total Time Spent on Website Page Views Per Visit ... Digital Advertisement Through Recommendations Receive More Updates About Our Courses Tags Update me on Supply Chain Content Get updates on DM Content City I agree to pay the amount through cheque A free copy of Mastering The Interview Last Notable Activity
0 7927b2df-8bba-4d29-b9a2-b6e0beafe620 660737.0 API Olark Chat No No 0.0 0.0 0.0 0.0 ... No No No Interested in other courses No No Mumbai No No Modified
1 2a272436-5132-4136-86fa-dcc88c88f482 660728.0 API Organic Search No No 0.0 5.0 674.0 2.5 ... No No No Ringing No No Mumbai No No Email Opened
2 8cc8c611-a219-4f35-ad23-fdfd2656bd8a 660727.0 Landing Page Submission Direct Traffic No No 1.0 2.0 1532.0 2.0 ... No No No Will revert after reading the email No No Mumbai No Yes Email Opened
3 0cc2df48-7cf4-4e39-9de9-19797f9b38cc 660719.0 Landing Page Submission Direct Traffic No No 0.0 1.0 305.0 1.0 ... No No No Ringing No No Mumbai No No Modified
4 3256f628-e534-4826-9d63-4a8b88782852 660681.0 Landing Page Submission Google No No 1.0 2.0 1428.0 1.0 ... No No No Will revert after reading the email No No Mumbai No No Modified

5 rows × 30 columns

Converting some binary variables (Yes/No) to 0/1¶

In [23]:
variable_list = ['Do Not Email', 'Do Not Call', 'Search', 'Magazine', 'Newspaper Article', 
               'X Education Forums', 'Newspaper', 'Digital Advertisement', 
               'Through Recommendations', 'Receive More Updates About Our Courses',
               'I agree to pay the amount through cheque', 'A free copy of Mastering The Interview','Update me on Supply Chain Content','Get updates on DM Content']
In [24]:
#df[variable_list] = df[variable_list].apply(lambda x: x.map({'Yes': 1,'No': 0}))
# creating a function binary_map to convert yes/no columns to 0/1

def binary_map(x):
    return x.map({'Yes': 1, "No": 0})

# Applying the function to the housing list
df[variable_list] = df[variable_list].apply(binary_map)
In [25]:
df.head()
Out[25]:
Prospect ID Lead Number Lead Origin Lead Source Do Not Email Do Not Call Converted TotalVisits Total Time Spent on Website Page Views Per Visit ... Digital Advertisement Through Recommendations Receive More Updates About Our Courses Tags Update me on Supply Chain Content Get updates on DM Content City I agree to pay the amount through cheque A free copy of Mastering The Interview Last Notable Activity
0 7927b2df-8bba-4d29-b9a2-b6e0beafe620 660737.0 API Olark Chat 0 0 0.0 0.0 0.0 0.0 ... 0 0 0 Interested in other courses 0 0 Mumbai 0 0 Modified
1 2a272436-5132-4136-86fa-dcc88c88f482 660728.0 API Organic Search 0 0 0.0 5.0 674.0 2.5 ... 0 0 0 Ringing 0 0 Mumbai 0 0 Email Opened
2 8cc8c611-a219-4f35-ad23-fdfd2656bd8a 660727.0 Landing Page Submission Direct Traffic 0 0 1.0 2.0 1532.0 2.0 ... 0 0 0 Will revert after reading the email 0 0 Mumbai 0 1 Email Opened
3 0cc2df48-7cf4-4e39-9de9-19797f9b38cc 660719.0 Landing Page Submission Direct Traffic 0 0 0.0 1.0 305.0 1.0 ... 0 0 0 Ringing 0 0 Mumbai 0 0 Modified
4 3256f628-e534-4826-9d63-4a8b88782852 660681.0 Landing Page Submission Google 0 0 1.0 2.0 1428.0 1.0 ... 0 0 0 Will revert after reading the email 0 0 Mumbai 0 0 Modified

5 rows × 30 columns

In [26]:
# Remove the column 'Prospect ID' it is of no use to us
df.drop('Prospect ID',axis =1, inplace = True)

For categorical variables with multiple levels, create dummy features (one-hot encoded)¶

In [27]:
categorical_cols = df.select_dtypes(include='object').columns.tolist()
In [28]:
categorical_cols
Out[28]:
['Lead Origin',
 'Lead Source',
 'Last Activity',
 'Country',
 'Specialization',
 'What is your current occupation',
 'What matters most to you in choosing a course',
 'Tags',
 'City',
 'Last Notable Activity']
In [29]:
# create dummy variables for the categorical columns
dummy_variables = pd.get_dummies(df[categorical_cols], drop_first= True).astype(int)
In [30]:
# Merge the dummy variables with the original dataset
df = pd.concat([df,dummy_variables],axis=1)
df.shape
Out[30]:
(9240, 175)
In [31]:
# Drop the categorical_cols from the data as we have created dummy variables for them
df.drop(categorical_cols, axis =1,inplace = True)
In [32]:
df.shape
Out[32]:
(9240, 165)
In [33]:
df.head()
Out[33]:
Lead Number Do Not Email Do Not Call Converted TotalVisits Total Time Spent on Website Page Views Per Visit Search Magazine Newspaper Article ... Last Notable Activity_Form Submitted on Website Last Notable Activity_Had a Phone Conversation Last Notable Activity_Modified Last Notable Activity_Olark Chat Conversation Last Notable Activity_Page Visited on Website Last Notable Activity_Resubscribed to emails Last Notable Activity_SMS Sent Last Notable Activity_Unreachable Last Notable Activity_Unsubscribed Last Notable Activity_View in browser link Clicked
0 660737.0 0 0 0.0 0.0 0.0 0.0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
1 660728.0 0 0 0.0 5.0 674.0 2.5 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 660727.0 0 0 1.0 2.0 1532.0 2.0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 660719.0 0 0 0.0 1.0 305.0 1.0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
4 660681.0 0 0 1.0 2.0 1428.0 1.0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 165 columns

Checking the conversion rate¶

In [34]:
lead_conversion_rate = (sum(df['Converted'])/len(df['Converted'].index))*100
lead_conversion_rate
Out[34]:
38.53896103896104

Step-4 : Looking at correlations¶

In [46]:
# Correlation matrix using heatmap
plt.figure(figsize=(60,40))
sns.heatmap(df.corr(),annot = True)
plt.show()
No description has been provided for this image
In [47]:
# Heatmap for numerical columns
plt.figure(figsize = (12,10))
sns.heatmap(df.select_dtypes(include=['int64','int32']).corr(), annot = True)
plt.show()
No description has been provided for this image
In [48]:
# heatmap for only float value columns
plt.figure(figsize = (12,10))
sns.heatmap(df.select_dtypes(include=['float']).corr(), annot = True)
plt.show()
No description has been provided for this image
In [35]:
# Create correlation matrix for dummy_variables
leads_encoded = pd.get_dummies(df, drop_first=True)

plt.figure(figsize=(20,10))
sns.heatmap(leads_encoded.corr(), annot=False, cmap="coolwarm")  # annot=False for clarity
plt.show()
No description has been provided for this image
In [36]:
# correlation matrix using top features
corr_matrix = leads_encoded.corr()

# pick top 20 features most correlated with Converted
top_features = corr_matrix["Converted"].abs().sort_values(ascending=False).head(20).index

plt.figure(figsize=(12,8))
sns.heatmap(corr_matrix.loc[top_features, top_features], annot=True, fmt=".2f")
plt.title("Top 20 Correlated Features")
plt.show()
No description has been provided for this image
In [37]:
# dropping the columns which are highly correlated
df.drop(['What is your current occupation_Unemployed', 'What is your current occupation_Working Professional'], axis=1, inplace=True)
In [38]:
# Get new dummy variables after dropping the higly correlated columns
leads_encoded_new = pd.get_dummies(df, drop_first=True)

plt.figure(figsize=(20,10))
sns.heatmap(leads_encoded.corr(), annot=False, cmap="coolwarm")  # annot=False for clarity
plt.show()
No description has been provided for this image
In [39]:
# correlation matrix
corr_matrix = leads_encoded_new.corr()
# pick top 20 features most correlated with Converted
top_features = corr_matrix["Converted"].abs().sort_values(ascending=False).head(20).index

plt.figure(figsize=(12,8))
sns.heatmap(corr_matrix.loc[top_features, top_features], annot=True, fmt=".2f")
plt.title("Top 20 Correlated Features")
plt.show()
No description has been provided for this image

Step-5: Split data into input and target variables¶

In [40]:
# splitting the data into input and target variables
x = df.drop(columns =['Converted','Lead Number'],axis=1)
y = df['Converted']
x.shape
Out[40]:
(9240, 161)
In [41]:
# Let's start with splitting data into train and test data

x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=0.2,random_state= 42)

Step-6 : Feature Scaling¶

In [42]:
# Feature scaling using standard scaler
scaler = StandardScaler()
x_train[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']]= scaler.fit_transform(x_train[['TotalVisits',
                                                                                                             'Total Time Spent on Website',
                                                                                                             'Page Views Per Visit']])
In [43]:
x_train.head()
Out[43]:
Do Not Email Do Not Call TotalVisits Total Time Spent on Website Page Views Per Visit Search Magazine Newspaper Article X Education Forums Newspaper ... Last Notable Activity_Form Submitted on Website Last Notable Activity_Had a Phone Conversation Last Notable Activity_Modified Last Notable Activity_Olark Chat Conversation Last Notable Activity_Page Visited on Website Last Notable Activity_Resubscribed to emails Last Notable Activity_SMS Sent Last Notable Activity_Unreachable Last Notable Activity_Unsubscribed Last Notable Activity_View in browser link Clicked
6487 1 0 -0.433198 -0.454165 -0.149720 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
4759 0 0 -1.129102 -0.889097 -1.272692 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
4368 0 0 -0.085246 -0.168456 0.411765 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1467 0 0 0.262706 0.737805 0.973251 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
5517 0 0 -0.433198 -0.628866 -0.149720 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0

5 rows × 161 columns

Few Insights on the Variables affecting lead conversion¶

In [44]:
# Insight 1: Overall Conversion Rate
import matplotlib.pyplot as plt
import seaborn as sns
conversion_rate = y.mean() * 100
plt.figure(figsize=(5, 4))
y.value_counts().plot(kind='bar')
plt.title(f'Lead Conversion Distribution (Rate: {conversion_rate:.2f}%)')
plt.xlabel('Converted')
plt.ylabel('Count')
plt.show()
print(f"INSIGHT 1: Overall conversion rate is {conversion_rate:.2f}%, a benchmark for lead nurturing success.")
No description has been provided for this image
INSIGHT 1: Overall conversion rate is 38.54%, a benchmark for lead nurturing success.
In [49]:
# Insight 2: Conversion by Lead Source
lead_source_conv = df_copy.groupby('Lead Source')['Converted'].mean().sort_values(ascending=False)
plt.figure(figsize=(10, 4))
lead_source_conv.plot(kind='bar')
plt.title('Conversion Rate by Lead Source')
plt.ylabel('Conversion Rate')
plt.xticks(rotation=45, ha='right')
plt.show()
print("INSIGHT 2: High-converting sources should get priority in marketing budgets.")
No description has been provided for this image
INSIGHT 2: High-converting sources should get priority in marketing budgets.
In [51]:
# Insight 3: Conversion by Lead Origin
lead_origin_conv = df_copy.groupby('Lead Origin')['Converted'].mean().sort_values(ascending=False)
plt.figure(figsize=(8, 4))
lead_origin_conv.plot(kind='bar')
plt.title('Conversion Rate by Lead Origin')
plt.ylabel('Conversion Rate')
plt.xticks(rotation=45, ha='right')
plt.show()
print("INSIGHT 3: Lead Origin is a strong segmentation variable — optimize acquisition channels accordingly.")
No description has been provided for this image
INSIGHT 3: Lead Origin is a strong segmentation variable — optimize acquisition channels accordingly.
In [52]:
# Insight 4: Total Visits vs Conversion
if 'TotalVisits' in x.columns:
    df_visits = df.copy()
    df_visits['TotalVisits_bin'] = pd.cut(df_visits['TotalVisits'], bins=[0, 1, 3, 5, 10, np.inf], labels=['0-1', '2-3', '4-5', '6-10', '10+'])
    visit_conv = df_visits.groupby('TotalVisits_bin')['Converted'].mean()
    plt.figure(figsize=(7, 4))
    visit_conv.plot(kind='bar')
    plt.title('Conversion Rate by Total Visits')
    plt.ylabel('Conversion Rate')
    plt.show()
    print("INSIGHT 4: Visitors with more site visits have higher conversion probability — retarget low-visit users.")
No description has been provided for this image
INSIGHT 4: Visitors with more site visits have higher conversion probability — retarget low-visit users.
In [53]:
# Insight 5: Total Time Spent vs Conversion
if 'Total Time Spent on Website' in x.columns:
    plt.figure(figsize=(8, 4))
    df.boxplot(column='Total Time Spent on Website', by='Converted')
    plt.suptitle('')
    plt.title('Website Time vs Conversion')
    plt.ylabel('Total Time Spent')
    plt.show()
    print("INSIGHT 5: Higher engagement time strongly correlates with conversion.")
<Figure size 800x400 with 0 Axes>
No description has been provided for this image
INSIGHT 5: Higher engagement time strongly correlates with conversion.

Step -7: Model Building¶

In [54]:
import statsmodels.api as sm
In [55]:
# Logistic Regression Model
logm1 = sm.GLM(y_train, (sm.add_constant(x_train)), family = sm.families.Binomial())
logm1.fit().summary()
Out[55]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7245
Model Family: Binomial Df Model: 146
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: nan
Date: Fri, 26 Sep 2025 Deviance: 93393.
Time: 20:13:05 Pearson chi2: 4.57e+18
No. Iterations: 100 Pseudo R-squ. (CS): nan
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const 1.304e+16 9.4e+07 1.39e+08 0.000 1.3e+16 1.3e+16
Do Not Email -4.491e+14 4.14e+06 -1.09e+08 0.000 -4.49e+14 -4.49e+14
Do Not Call 7.764e+15 4.77e+07 1.63e+08 0.000 7.76e+15 7.76e+15
TotalVisits 8.281e+13 1.32e+06 6.29e+07 0.000 8.28e+13 8.28e+13
Total Time Spent on Website 4.253e+14 9.26e+05 4.6e+08 0.000 4.25e+14 4.25e+14
Page Views Per Visit -9.822e+13 1.44e+06 -6.8e+07 0.000 -9.82e+13 -9.82e+13
Search 7.833e+14 2.11e+07 3.72e+07 0.000 7.83e+14 7.83e+14
Magazine 83.2603 8.93e-07 9.33e+07 0.000 83.260 83.260
Newspaper Article 2.107e+15 6.73e+07 3.13e+07 0.000 2.11e+15 2.11e+15
X Education Forums -9.99e+15 1.08e+08 -9.27e+07 0.000 -9.99e+15 -9.99e+15
Newspaper -7.93e+15 6.73e+07 -1.18e+08 0.000 -7.93e+15 -7.93e+15
Digital Advertisement 5.979e+14 3.91e+07 1.53e+07 0.000 5.98e+14 5.98e+14
Through Recommendations 1.285e+15 2.84e+07 4.52e+07 0.000 1.28e+15 1.28e+15
Receive More Updates About Our Courses 75.4163 1.06e-06 7.14e+07 0.000 75.416 75.416
Update me on Supply Chain Content -368.3959 2.09e-06 -1.76e+08 0.000 -368.396 -368.396
Get updates on DM Content 172.3771 1.11e-06 1.56e+08 0.000 172.377 172.377
I agree to pay the amount through cheque -124.1646 1.6e-06 -7.76e+07 0.000 -124.165 -124.165
A free copy of Mastering The Interview -1.357e+14 2.52e+06 -5.38e+07 0.000 -1.36e+14 -1.36e+14
Lead Origin_Landing Page Submission -1.867e+14 2.81e+06 -6.65e+07 0.000 -1.87e+14 -1.87e+14
Lead Origin_Lead Add Form 8.759e+14 1.3e+07 6.73e+07 0.000 8.76e+14 8.76e+14
Lead Origin_Lead Import 2.215e+15 4.93e+07 4.49e+07 0.000 2.22e+15 2.22e+15
Lead Origin_Quick Add Form 1.19e+16 6.79e+07 1.75e+08 0.000 1.19e+16 1.19e+16
Lead Source_Direct Traffic -9.534e+14 4.2e+07 -2.27e+07 0.000 -9.53e+14 -9.53e+14
Lead Source_Facebook -3.326e+15 6.47e+07 -5.14e+07 0.000 -3.33e+15 -3.33e+15
Lead Source_Google -9.919e+14 4.2e+07 -2.36e+07 0.000 -9.92e+14 -9.92e+14
Lead Source_Live Chat 3.077e+14 6.21e+07 4.96e+06 0.000 3.08e+14 3.08e+14
Lead Source_NC_EDM 4.096e+15 7.92e+07 5.17e+07 0.000 4.1e+15 4.1e+15
Lead Source_Olark Chat -8.88e+14 4.21e+07 -2.11e+07 0.000 -8.88e+14 -8.88e+14
Lead Source_Organic Search -8.593e+14 4.2e+07 -2.04e+07 0.000 -8.59e+14 -8.59e+14
Lead Source_Pay per Click Ads -3.839e+15 7.93e+07 -4.84e+07 0.000 -3.84e+15 -3.84e+15
Lead Source_Press_Release -3.9298 3.8e-07 -1.03e+07 0.000 -3.930 -3.930
Lead Source_Reference -6.961e+14 4.01e+07 -1.74e+07 0.000 -6.96e+14 -6.96e+14
Lead Source_Referral Sites -7.789e+14 4.25e+07 -1.83e+07 0.000 -7.79e+14 -7.79e+14
Lead Source_Social Media -1.16e+15 6.35e+07 -1.83e+07 0.000 -1.16e+15 -1.16e+15
Lead Source_WeLearn 5.764e+15 7.92e+07 7.27e+07 0.000 5.76e+15 5.76e+15
Lead Source_Welingak Website -3.273e+14 4.06e+07 -8.07e+06 0.000 -3.27e+14 -3.27e+14
Lead Source_bing -4.644e+14 5.17e+07 -8.99e+06 0.000 -4.64e+14 -4.64e+14
Lead Source_blog -3.23e+15 7.93e+07 -4.08e+07 0.000 -3.23e+15 -3.23e+15
Lead Source_google -2.075e+15 5.18e+07 -4e+07 0.000 -2.07e+15 -2.07e+15
Lead Source_testone -3.733e+15 7.95e+07 -4.69e+07 0.000 -3.73e+15 -3.73e+15
Lead Source_welearnblog_Home -362.5757 1.95e-06 -1.86e+08 0.000 -362.576 -362.576
Lead Source_youtubechannel -1.07e+15 9.34e+07 -1.15e+07 0.000 -1.07e+15 -1.07e+15
Last Activity_Converted to Lead -1.525e+15 2.44e+07 -6.25e+07 0.000 -1.53e+15 -1.53e+15
Last Activity_Email Bounced -1.523e+15 2.48e+07 -6.13e+07 0.000 -1.52e+15 -1.52e+15
Last Activity_Email Link Clicked -1.307e+15 2.52e+07 -5.19e+07 0.000 -1.31e+15 -1.31e+15
Last Activity_Email Marked Spam -4.607e+15 4.15e+07 -1.11e+08 0.000 -4.61e+15 -4.61e+15
Last Activity_Email Opened -1.398e+15 2.41e+07 -5.8e+07 0.000 -1.4e+15 -1.4e+15
Last Activity_Email Received 2.695e+16 7.14e+07 3.78e+08 0.000 2.7e+16 2.7e+16
Last Activity_Form Submitted on Website -1.571e+15 2.51e+07 -6.26e+07 0.000 -1.57e+15 -1.57e+15
Last Activity_Had a Phone Conversation -1.539e+15 3.03e+07 -5.08e+07 0.000 -1.54e+15 -1.54e+15
Last Activity_Olark Chat Conversation -9.413e+14 2.42e+07 -3.89e+07 0.000 -9.41e+14 -9.41e+14
Last Activity_Page Visited on Website -1.594e+15 2.44e+07 -6.52e+07 0.000 -1.59e+15 -1.59e+15
Last Activity_Resubscribed to emails -60.4735 6.35e-07 -9.53e+07 0.000 -60.474 -60.474
Last Activity_SMS Sent -1.263e+15 2.42e+07 -5.21e+07 0.000 -1.26e+15 -1.26e+15
Last Activity_Unreachable -1.415e+15 2.61e+07 -5.43e+07 0.000 -1.42e+15 -1.42e+15
Last Activity_Unsubscribed -2.456e+15 3.11e+07 -7.91e+07 0.000 -2.46e+15 -2.46e+15
Last Activity_View in browser link Clicked -1.267e+15 4.14e+07 -3.06e+07 0.000 -1.27e+15 -1.27e+15
Last Activity_Visited Booth in Tradeshow 1.027e+15 7.21e+07 1.42e+07 0.000 1.03e+15 1.03e+15
Country_Australia 7.261e+12 5.27e+07 1.38e+05 0.000 7.26e+12 7.26e+12
Country_Bahrain 1.462e+15 5.63e+07 2.59e+07 0.000 1.46e+15 1.46e+15
Country_Bangladesh 6.949e+14 6.73e+07 1.03e+07 0.000 6.95e+14 6.95e+14
Country_Belgium -6.948e+15 6.74e+07 -1.03e+08 0.000 -6.95e+15 -6.95e+15
Country_Canada -3.783e+14 6.74e+07 -5.61e+06 0.000 -3.78e+14 -3.78e+14
Country_China -3.181e+15 6.72e+07 -4.73e+07 0.000 -3.18e+15 -3.18e+15
Country_Denmark 1.59e+15 8.24e+07 1.93e+07 0.000 1.59e+15 1.59e+15
Country_France 3.332e+14 5.64e+07 5.91e+06 0.000 3.33e+14 3.33e+14
Country_Germany 6.687e+14 5.83e+07 1.15e+07 0.000 6.69e+14 6.69e+14
Country_Ghana -2.716e+15 8.23e+07 -3.3e+07 0.000 -2.72e+15 -2.72e+15
Country_Hong Kong 1.192e+15 5.49e+07 2.17e+07 0.000 1.19e+15 1.19e+15
Country_India 2.573e+14 4.75e+07 5.41e+06 0.000 2.57e+14 2.57e+14
Country_Indonesia 27.5231 3.43e-07 8.02e+07 0.000 27.523 27.523
Country_Italy -7.526e+15 6.76e+07 -1.11e+08 0.000 -7.53e+15 -7.53e+15
Country_Kenya -4.678e+15 8.24e+07 -5.68e+07 0.000 -4.68e+15 -4.68e+15
Country_Kuwait 3.061e+14 6.15e+07 4.98e+06 0.000 3.06e+14 3.06e+14
Country_Liberia -2.553e+15 8.26e+07 -3.09e+07 0.000 -2.55e+15 -2.55e+15
Country_Malaysia -5.8e+14 8.28e+07 -7.01e+06 0.000 -5.8e+14 -5.8e+14
Country_Netherlands 2.155e+15 8.24e+07 2.62e+07 0.000 2.16e+15 2.16e+15
Country_Nigeria -1.658e+15 6.15e+07 -2.7e+07 0.000 -1.66e+15 -1.66e+15
Country_Oman 5.073e+14 6.15e+07 8.25e+06 0.000 5.07e+14 5.07e+14
Country_Philippines -2.038e+15 8.24e+07 -2.47e+07 0.000 -2.04e+15 -2.04e+15
Country_Qatar -9.695e+12 5.26e+07 -1.84e+05 0.000 -9.7e+12 -9.7e+12
Country_Russia 17.7277 1.21e-07 1.46e+08 0.000 17.728 17.728
Country_Saudi Arabia 3.506e+14 5.03e+07 6.97e+06 0.000 3.51e+14 3.51e+14
Country_Singapore 7.518e+14 5e+07 1.5e+07 0.000 7.52e+14 7.52e+14
Country_South Africa -7.621e+15 6.74e+07 -1.13e+08 0.000 -7.62e+15 -7.62e+15
Country_Sri Lanka -4.199e+15 8.26e+07 -5.08e+07 0.000 -4.2e+15 -4.2e+15
Country_Sweden -2.908e+14 6.13e+07 -4.74e+06 0.000 -2.91e+14 -2.91e+14
Country_Switzerland -3.688e+15 8.24e+07 -4.48e+07 0.000 -3.69e+15 -3.69e+15
Country_Tanzania -6.373e+15 8.26e+07 -7.71e+07 0.000 -6.37e+15 -6.37e+15
Country_Uganda -8.12e+15 6.73e+07 -1.21e+08 0.000 -8.12e+15 -8.12e+15
Country_United Arab Emirates 4.267e+14 4.87e+07 8.77e+06 0.000 4.27e+14 4.27e+14
Country_United Kingdom 5.658e+14 5.14e+07 1.1e+07 0.000 5.66e+14 5.66e+14
Country_United States 5.65e+14 4.84e+07 1.17e+07 0.000 5.65e+14 5.65e+14
Country_Vietnam 5.7824 1.07e-07 5.39e+07 0.000 5.782 5.782
Country_unknown 1.154e+15 6.14e+07 1.88e+07 0.000 1.15e+15 1.15e+15
Specialization_Business Administration -2.254e+13 5.57e+06 -4.05e+06 0.000 -2.25e+13 -2.25e+13
Specialization_E-Business -4.093e+14 1.07e+07 -3.81e+07 0.000 -4.09e+14 -4.09e+14
Specialization_E-COMMERCE -3.287e+14 8.23e+06 -3.99e+07 0.000 -3.29e+14 -3.29e+14
Specialization_Finance Management -5.233e+14 4.39e+06 -1.19e+08 0.000 -5.23e+14 -5.23e+14
Specialization_Healthcare Management -1.034e+14 7.29e+06 -1.42e+07 0.000 -1.03e+14 -1.03e+14
Specialization_Hospitality Management -3.198e+14 8.18e+06 -3.91e+07 0.000 -3.2e+14 -3.2e+14
Specialization_Human Resource Management -1.586e+14 4.79e+06 -3.31e+07 0.000 -1.59e+14 -1.59e+14
Specialization_IT Projects Management -5.802e+13 5.63e+06 -1.03e+07 0.000 -5.8e+13 -5.8e+13
Specialization_International Business -3.956e+14 6.86e+06 -5.77e+07 0.000 -3.96e+14 -3.96e+14
Specialization_Marketing Management -5.792e+13 4.78e+06 -1.21e+07 0.000 -5.79e+13 -5.79e+13
Specialization_Media and Advertising -1.647e+14 6.62e+06 -2.49e+07 0.000 -1.65e+14 -1.65e+14
Specialization_Operations Management -9.508e+13 5.25e+06 -1.81e+07 0.000 -9.51e+13 -9.51e+13
Specialization_Retail Management -3.182e+14 8.47e+06 -3.76e+07 0.000 -3.18e+14 -3.18e+14
Specialization_Rural and Agribusiness -1.929e+14 9.88e+06 -1.95e+07 0.000 -1.93e+14 -1.93e+14
Specialization_Services Excellence -3.007e+14 1.23e+07 -2.45e+07 0.000 -3.01e+14 -3.01e+14
Specialization_Supply Chain Management -2.245e+14 5.76e+06 -3.89e+07 0.000 -2.24e+14 -2.24e+14
Specialization_Travel and Tourism -1.29e+14 6.71e+06 -1.92e+07 0.000 -1.29e+14 -1.29e+14
What is your current occupation_Housewife 3.677e+15 2.14e+07 1.72e+08 0.000 3.68e+15 3.68e+15
What is your current occupation_Other 4.496e+14 2.14e+07 2.1e+07 0.000 4.5e+14 4.5e+14
What is your current occupation_Student 4.869e+14 5.72e+06 8.51e+07 0.000 4.87e+14 4.87e+14
What matters most to you in choosing a course_Flexibility & Convenience 1.665e+15 6.75e+07 2.47e+07 0.000 1.66e+15 1.66e+15
What matters most to you in choosing a course_Other -3.438e+15 6.77e+07 -5.08e+07 0.000 -3.44e+15 -3.44e+15
Tags_Busy 1.294e+15 6.9e+06 1.88e+08 0.000 1.29e+15 1.29e+15
Tags_Closed by Horizzon 2.339e+15 5.91e+06 3.96e+08 0.000 2.34e+15 2.34e+15
Tags_Diploma holder (Not Eligible) -2.987e+15 1.08e+07 -2.76e+08 0.000 -2.99e+15 -2.99e+15
Tags_Graduation in progress 2.658e+14 8.2e+06 3.24e+07 0.000 2.66e+14 2.66e+14
Tags_In confusion whether part time or DLP 1.269e+15 3.38e+07 3.75e+07 0.000 1.27e+15 1.27e+15
Tags_Interested in full time MBA 1.02e+14 8.07e+06 1.26e+07 0.000 1.02e+14 1.02e+14
Tags_Interested in Next batch 4.409e+15 3.04e+07 1.45e+08 0.000 4.41e+15 4.41e+15
Tags_Interested in other courses 7.297e+12 5e+06 1.46e+06 0.000 7.3e+12 7.3e+12
Tags_Lateral student 6.567e+15 6.73e+07 9.75e+07 0.000 6.57e+15 6.57e+15
Tags_Lost to EINS 2.374e+15 6.77e+06 3.5e+08 0.000 2.37e+15 2.37e+15
Tags_Lost to Others -2.498e+14 2.8e+07 -8.92e+06 0.000 -2.5e+14 -2.5e+14
Tags_Not doing further education -5.533e+14 7.38e+06 -7.49e+07 0.000 -5.53e+14 -5.53e+14
Tags_Recognition issue (DEC approval) 7.9648 3.36e-08 2.37e+08 0.000 7.965 7.965
Tags_Ringing -3.555e+14 4.35e+06 -8.16e+07 0.000 -3.55e+14 -3.55e+14
Tags_Shall take in the next coming month -3.932e+15 6.73e+07 -5.84e+07 0.000 -3.93e+15 -3.93e+15
Tags_Still Thinking -1.798e+15 3.04e+07 -5.92e+07 0.000 -1.8e+15 -1.8e+15
Tags_University not recognized -1.259e+15 4.79e+07 -2.63e+07 0.000 -1.26e+15 -1.26e+15
Tags_Want to take admission but has financial problems 5.213e+14 2.87e+07 1.82e+07 0.000 5.21e+14 5.21e+14
Tags_Will revert after reading the email 7.926e+14 3.87e+06 2.05e+08 0.000 7.93e+14 7.93e+14
Tags_in touch with EINS 8.055e+14 2.06e+07 3.91e+07 0.000 8.06e+14 8.06e+14
Tags_invalid number -4.604e+14 9.19e+06 -5.01e+07 0.000 -4.6e+14 -4.6e+14
Tags_number not provided -8.383e+14 1.44e+07 -5.82e+07 0.000 -8.38e+14 -8.38e+14
Tags_opp hangup 2.122e+14 1.25e+07 1.69e+07 0.000 2.12e+14 2.12e+14
Tags_switched off -8.503e+14 6.19e+06 -1.37e+08 0.000 -8.5e+14 -8.5e+14
Tags_wrong number given -4.881e+14 1.12e+07 -4.35e+07 0.000 -4.88e+14 -4.88e+14
City_Other Cities 8.205e+13 3.22e+06 2.55e+07 0.000 8.21e+13 8.21e+13
City_Other Cities of Maharashtra 4.574e+13 3.81e+06 1.2e+07 0.000 4.57e+13 4.57e+13
City_Other Metro Cities 2.513e+13 4.21e+06 5.96e+06 0.000 2.51e+13 2.51e+13
City_Thane & Outskirts 1.423e+14 3e+06 4.74e+07 0.000 1.42e+14 1.42e+14
City_Tier II Cities 2.168e+14 9.03e+06 2.4e+07 0.000 2.17e+14 2.17e+14
Last Notable Activity_Email Bounced -1.124e+16 7.41e+07 -1.52e+08 0.000 -1.12e+16 -1.12e+16
Last Notable Activity_Email Link Clicked -1.182e+16 7.39e+07 -1.6e+08 0.000 -1.18e+16 -1.18e+16
Last Notable Activity_Email Marked Spam -4.607e+15 4.15e+07 -1.11e+08 0.000 -4.61e+15 -4.61e+15
Last Notable Activity_Email Opened -1.21e+16 7.33e+07 -1.65e+08 0.000 -1.21e+16 -1.21e+16
Last Notable Activity_Email Received -3.35e+16 1.2e+08 -2.79e+08 0.000 -3.35e+16 -3.35e+16
Last Notable Activity_Form Submitted on Website 0 0 nan nan 0 0
Last Notable Activity_Had a Phone Conversation -1.088e+16 7.79e+07 -1.4e+08 0.000 -1.09e+16 -1.09e+16
Last Notable Activity_Modified -1.186e+16 7.33e+07 -1.62e+08 0.000 -1.19e+16 -1.19e+16
Last Notable Activity_Olark Chat Conversation -1.227e+16 7.35e+07 -1.67e+08 0.000 -1.23e+16 -1.23e+16
Last Notable Activity_Page Visited on Website -1.154e+16 7.35e+07 -1.57e+08 0.000 -1.15e+16 -1.15e+16
Last Notable Activity_Resubscribed to emails 0 0 nan nan 0 0
Last Notable Activity_SMS Sent -1.076e+16 7.34e+07 -1.47e+08 0.000 -1.08e+16 -1.08e+16
Last Notable Activity_Unreachable -1.101e+16 7.5e+07 -1.47e+08 0.000 -1.1e+16 -1.1e+16
Last Notable Activity_Unsubscribed -1.05e+16 7.67e+07 -1.37e+08 0.000 -1.05e+16 -1.05e+16
Last Notable Activity_View in browser link Clicked -1.141e+16 1.05e+08 -1.09e+08 0.000 -1.14e+16 -1.14e+16

Feature selection by using RFE¶

In [56]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=1000)
In [57]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, n_features_to_select=15)
rfe = rfe.fit(x_train,y_train)
In [58]:
rfe.support_
Out[58]:
array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
        True, False, False, False, False, False, False, False,  True,
       False, False, False,  True, False, False, False, False,  True,
       False,  True,  True, False,  True,  True, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False, False, False,  True, False,  True, False])
In [59]:
list(zip(x_train.columns, rfe.support_, rfe.ranking_))
Out[59]:
[('Do Not Email', True, 1),
 ('Do Not Call', False, 99),
 ('TotalVisits', False, 81),
 ('Total Time Spent on Website', False, 2),
 ('Page Views Per Visit', False, 80),
 ('Search', False, 110),
 ('Magazine', False, 134),
 ('Newspaper Article', False, 65),
 ('X Education Forums', False, 64),
 ('Newspaper', False, 118),
 ('Digital Advertisement', False, 79),
 ('Through Recommendations', False, 49),
 ('Receive More Updates About Our Courses', False, 144),
 ('Update me on Supply Chain Content', False, 146),
 ('Get updates on DM Content', False, 141),
 ('I agree to pay the amount through cheque', False, 136),
 ('A free copy of Mastering The Interview', False, 57),
 ('Lead Origin_Landing Page Submission', False, 9),
 ('Lead Origin_Lead Add Form', True, 1),
 ('Lead Origin_Lead Import', False, 26),
 ('Lead Origin_Quick Add Form', False, 73),
 ('Lead Source_Direct Traffic', False, 78),
 ('Lead Source_Facebook', False, 100),
 ('Lead Source_Google', False, 85),
 ('Lead Source_Live Chat', False, 114),
 ('Lead Source_NC_EDM', False, 67),
 ('Lead Source_Olark Chat', False, 27),
 ('Lead Source_Organic Search', False, 101),
 ('Lead Source_Pay per Click Ads', False, 127),
 ('Lead Source_Press_Release', False, 142),
 ('Lead Source_Reference', False, 40),
 ('Lead Source_Referral Sites', False, 84),
 ('Lead Source_Social Media', False, 37),
 ('Lead Source_WeLearn', False, 105),
 ('Lead Source_Welingak Website', True, 1),
 ('Lead Source_bing', False, 53),
 ('Lead Source_blog', False, 45),
 ('Lead Source_google', False, 34),
 ('Lead Source_testone', False, 123),
 ('Lead Source_welearnblog_Home', False, 140),
 ('Lead Source_youtubechannel', False, 56),
 ('Last Activity_Converted to Lead', False, 47),
 ('Last Activity_Email Bounced', False, 21),
 ('Last Activity_Email Link Clicked', False, 108),
 ('Last Activity_Email Marked Spam', False, 66),
 ('Last Activity_Email Opened', False, 121),
 ('Last Activity_Email Received', False, 29),
 ('Last Activity_Form Submitted on Website', False, 111),
 ('Last Activity_Had a Phone Conversation', False, 12),
 ('Last Activity_Olark Chat Conversation', False, 24),
 ('Last Activity_Page Visited on Website', False, 63),
 ('Last Activity_Resubscribed to emails', False, 138),
 ('Last Activity_SMS Sent', False, 14),
 ('Last Activity_Unreachable', False, 106),
 ('Last Activity_Unsubscribed', False, 82),
 ('Last Activity_View in browser link Clicked', False, 86),
 ('Last Activity_Visited Booth in Tradeshow', False, 133),
 ('Country_Australia', False, 116),
 ('Country_Bahrain', False, 74),
 ('Country_Bangladesh', False, 44),
 ('Country_Belgium', False, 41),
 ('Country_Canada', False, 35),
 ('Country_China', False, 88),
 ('Country_Denmark', False, 122),
 ('Country_France', False, 87),
 ('Country_Germany', False, 28),
 ('Country_Ghana', False, 75),
 ('Country_Hong Kong', False, 30),
 ('Country_India', False, 72),
 ('Country_Indonesia', False, 143),
 ('Country_Italy', False, 11),
 ('Country_Kenya', False, 129),
 ('Country_Kuwait', False, 95),
 ('Country_Liberia', False, 115),
 ('Country_Malaysia', False, 126),
 ('Country_Netherlands', False, 102),
 ('Country_Nigeria', False, 76),
 ('Country_Oman', False, 96),
 ('Country_Philippines', False, 128),
 ('Country_Qatar', False, 61),
 ('Country_Russia', False, 147),
 ('Country_Saudi Arabia', False, 22),
 ('Country_Singapore', False, 42),
 ('Country_South Africa', False, 125),
 ('Country_Sri Lanka', False, 132),
 ('Country_Sweden', False, 107),
 ('Country_Switzerland', False, 46),
 ('Country_Tanzania', False, 131),
 ('Country_Uganda', False, 120),
 ('Country_United Arab Emirates', False, 69),
 ('Country_United Kingdom', False, 18),
 ('Country_United States', False, 36),
 ('Country_Vietnam', False, 135),
 ('Country_unknown', False, 117),
 ('Specialization_Business Administration', False, 62),
 ('Specialization_E-Business', False, 48),
 ('Specialization_E-COMMERCE', False, 70),
 ('Specialization_Finance Management', False, 8),
 ('Specialization_Healthcare Management', False, 93),
 ('Specialization_Hospitality Management', False, 55),
 ('Specialization_Human Resource Management', False, 92),
 ('Specialization_IT Projects Management', False, 112),
 ('Specialization_International Business', False, 43),
 ('Specialization_Marketing Management', False, 130),
 ('Specialization_Media and Advertising', False, 90),
 ('Specialization_Operations Management', False, 103),
 ('Specialization_Retail Management', False, 50),
 ('Specialization_Rural and Agribusiness', False, 104),
 ('Specialization_Services Excellence', False, 91),
 ('Specialization_Supply Chain Management', False, 89),
 ('Specialization_Travel and Tourism', False, 94),
 ('What is your current occupation_Housewife', False, 4),
 ('What is your current occupation_Other', False, 17),
 ('What is your current occupation_Student', False, 13),
 ('What matters most to you in choosing a course_Flexibility & Convenience',
  False,
  113),
 ('What matters most to you in choosing a course_Other', False, 124),
 ('Tags_Busy', True, 1),
 ('Tags_Closed by Horizzon', True, 1),
 ('Tags_Diploma holder (Not Eligible)', False, 19),
 ('Tags_Graduation in progress', False, 32),
 ('Tags_In confusion whether part time or DLP', False, 16),
 ('Tags_Interested  in full time MBA', False, 52),
 ('Tags_Interested in Next batch', False, 10),
 ('Tags_Interested in other courses', False, 38),
 ('Tags_Lateral student', False, 15),
 ('Tags_Lost to EINS', True, 1),
 ('Tags_Lost to Others', False, 54),
 ('Tags_Not doing further education', False, 31),
 ('Tags_Recognition issue (DEC approval)', False, 145),
 ('Tags_Ringing', True, 1),
 ('Tags_Shall take in the next coming month', False, 97),
 ('Tags_Still Thinking', False, 51),
 ('Tags_University not recognized', False, 109),
 ('Tags_Want to take admission but has financial problems', False, 23),
 ('Tags_Will revert after reading the email', True, 1),
 ('Tags_in touch with EINS', False, 5),
 ('Tags_invalid number', True, 1),
 ('Tags_number not provided', True, 1),
 ('Tags_opp hangup', False, 60),
 ('Tags_switched off', True, 1),
 ('Tags_wrong number given', True, 1),
 ('City_Other Cities', False, 59),
 ('City_Other Cities of Maharashtra', False, 98),
 ('City_Other Metro Cities', False, 77),
 ('City_Thane & Outskirts', False, 68),
 ('City_Tier II Cities', False, 58),
 ('Last Notable Activity_Email Bounced', False, 20),
 ('Last Notable Activity_Email Link Clicked', False, 25),
 ('Last Notable Activity_Email Marked Spam', False, 71),
 ('Last Notable Activity_Email Opened', False, 39),
 ('Last Notable Activity_Email Received', False, 83),
 ('Last Notable Activity_Form Submitted on Website', False, 137),
 ('Last Notable Activity_Had a Phone Conversation', True, 1),
 ('Last Notable Activity_Modified', False, 6),
 ('Last Notable Activity_Olark Chat Conversation', False, 3),
 ('Last Notable Activity_Page Visited on Website', False, 33),
 ('Last Notable Activity_Resubscribed to emails', False, 139),
 ('Last Notable Activity_SMS Sent', True, 1),
 ('Last Notable Activity_Unreachable', False, 7),
 ('Last Notable Activity_Unsubscribed', True, 1),
 ('Last Notable Activity_View in browser link Clicked', False, 119)]
In [60]:
# columns which are selected from RFE model
col = x_train.columns[rfe.support_]
col
Out[60]:
Index(['Do Not Email', 'Lead Origin_Lead Add Form',
       'Lead Source_Welingak Website', 'Tags_Busy', 'Tags_Closed by Horizzon',
       'Tags_Lost to EINS', 'Tags_Ringing',
       'Tags_Will revert after reading the email', 'Tags_invalid number',
       'Tags_number not provided', 'Tags_switched off',
       'Tags_wrong number given',
       'Last Notable Activity_Had a Phone Conversation',
       'Last Notable Activity_SMS Sent', 'Last Notable Activity_Unsubscribed'],
      dtype='object')

Assessing the model with statsmodel¶

In [61]:
x_train_sm = sm.add_constant(x_train[col])
logm2 = sm.GLM(y_train,x_train_sm, sm.families.Binomial())
res = logm2.fit()
res.summary()
Out[61]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7376
Model Family: Binomial Df Model: 15
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2613.0
Date: Fri, 26 Sep 2025 Deviance: 5225.9
Time: 20:13:12 Pearson chi2: 1.25e+04
No. Iterations: 23 Pseudo R-squ. (CS): 0.4635
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -3.6865 0.176 -20.980 0.000 -4.031 -3.342
Do Not Email -1.5251 0.170 -8.952 0.000 -1.859 -1.191
Lead Origin_Lead Add Form 2.4149 0.209 11.537 0.000 2.005 2.825
Lead Source_Welingak Website 1.9829 1.034 1.917 0.055 -0.045 4.010
Tags_Busy 2.7654 0.273 10.141 0.000 2.231 3.300
Tags_Closed by Horizzon 8.0383 0.734 10.959 0.000 6.601 9.476
Tags_Lost to EINS 7.2176 0.539 13.394 0.000 6.161 8.274
Tags_Ringing -1.7335 0.281 -6.170 0.000 -2.284 -1.183
Tags_Will revert after reading the email 3.0988 0.178 17.372 0.000 2.749 3.448
Tags_invalid number -2.0582 1.040 -1.980 0.048 -4.096 -0.021
Tags_number not provided -22.0940 2.21e+04 -0.001 0.999 -4.33e+04 4.33e+04
Tags_switched off -2.5735 0.741 -3.472 0.001 -4.026 -1.121
Tags_wrong number given -21.9906 1.73e+04 -0.001 0.999 -3.39e+04 3.38e+04
Last Notable Activity_Had a Phone Conversation 3.7106 1.169 3.175 0.001 1.420 6.001
Last Notable Activity_SMS Sent 2.8038 0.102 27.602 0.000 2.605 3.003
Last Notable Activity_Unsubscribed 2.1135 0.507 4.171 0.000 1.120 3.107
In [62]:
# Make predictions on the training data
y_train_pred = res.predict(x_train_sm)
y_train_pred
Out[62]:
6487    0.015654
4759    0.357183
4368    0.024448
1467    0.901694
5517    0.024448
          ...   
5734    0.004408
5191    0.357183
5390    0.990350
860     0.004408
7270    0.357183
Length: 7392, dtype: float64
In [63]:
# Reshape values 
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]
Out[63]:
array([0.01565423, 0.35718324, 0.02444791, 0.90169386, 0.02444791,
       0.00440771, 0.35718324, 0.35718324, 0.97155985, 0.02444791])
Creating a dataframe with the actual converted value and the predicted probabilities¶
In [64]:
y_train_pred_final = pd.DataFrame({'Converted':y_train.values,'converted_prob':y_train_pred})
y_train_pred_final['Prospect ID'] = y_train.index
y_train_pred_final
Out[64]:
Converted converted_prob Prospect ID
0 0.0 0.015654 6487
1 0.0 0.357183 4759
2 0.0 0.024448 4368
3 1.0 0.901694 1467
4 0.0 0.024448 5517
... ... ... ...
7387 0.0 0.004408 5734
7388 0.0 0.357183 5191
7389 1.0 0.990350 5390
7390 0.0 0.004408 860
7391 1.0 0.357183 7270

7392 rows × 3 columns

Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0¶
In [65]:
y_train_pred_final['Predicted'] = y_train_pred_final.converted_prob.map(lambda x : 1 if x>0.5 else 0)
y_train_pred_final
Out[65]:
Converted converted_prob Prospect ID Predicted
0 0.0 0.015654 6487 0
1 0.0 0.357183 4759 0
2 0.0 0.024448 4368 0
3 1.0 0.901694 1467 1
4 0.0 0.024448 5517 0
... ... ... ... ...
7387 0.0 0.004408 5734 0
7388 0.0 0.357183 5191 0
7389 1.0 0.990350 5390 1
7390 0.0 0.004408 860 0
7391 1.0 0.357183 7270 0

7392 rows × 4 columns

Confusion matrix¶

In [66]:
from sklearn import metrics
confusion_matrix = metrics.confusion_matrix(y_train_pred_final.Converted,y_train_pred_final.Predicted)
confusion_matrix
Out[66]:
array([[4428,  144],
       [1060, 1760]], dtype=int64)
In [67]:
accuracy_score = metrics.accuracy_score(y_train_pred_final.Converted,y_train_pred_final.Predicted)
accuracy_score
Out[67]:
0.8371212121212122

Checking VIFs¶

In [68]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train[col].columns
vif['VIF'] = [variance_inflation_factor(x_train[col].values,i) for i in range(x_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF',ascending = False)
vif
Out[68]:
Features VIF
1 Lead Origin_Lead Add Form 1.49
13 Last Notable Activity_SMS Sent 1.44
7 Tags_Will revert after reading the email 1.37
2 Lead Source_Welingak Website 1.22
4 Tags_Closed by Horizzon 1.16
0 Do Not Email 1.14
6 Tags_Ringing 1.09
14 Last Notable Activity_Unsubscribed 1.06
3 Tags_Busy 1.03
10 Tags_switched off 1.03
8 Tags_invalid number 1.01
9 Tags_number not provided 1.01
11 Tags_wrong number given 1.01
5 Tags_Lost to EINS 1.00
12 Last Notable Activity_Had a Phone Conversation 1.00
In [69]:
col = col.drop('Tags_number not provided',1)
In [70]:
x_train_sm = sm.add_constant(x_train[col])
logm3 = sm.GLM(y_train,x_train_sm, sm.families.Binomial())
res = logm3.fit()
res.summary()
Out[70]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7377
Model Family: Binomial Df Model: 14
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2615.4
Date: Fri, 26 Sep 2025 Deviance: 5230.8
Time: 20:13:17 Pearson chi2: 1.25e+04
No. Iterations: 23 Pseudo R-squ. (CS): 0.4631
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -3.7498 0.176 -21.259 0.000 -4.095 -3.404
Do Not Email -1.5271 0.170 -8.982 0.000 -1.860 -1.194
Lead Origin_Lead Add Form 2.4195 0.210 11.537 0.000 2.008 2.831
Lead Source_Welingak Website 1.9780 1.035 1.912 0.056 -0.050 4.006
Tags_Busy 2.8342 0.272 10.407 0.000 2.300 3.368
Tags_Closed by Horizzon 8.1015 0.734 11.042 0.000 6.663 9.539
Tags_Lost to EINS 7.2810 0.539 13.507 0.000 6.224 8.338
Tags_Ringing -1.6580 0.280 -5.919 0.000 -2.207 -1.109
Tags_Will revert after reading the email 3.1640 0.179 17.698 0.000 2.814 3.514
Tags_invalid number -1.9821 1.039 -1.907 0.057 -4.019 0.055
Tags_switched off -2.4968 0.741 -3.371 0.001 -3.949 -1.045
Tags_wrong number given -21.9152 1.73e+04 -0.001 0.999 -3.39e+04 3.39e+04
Last Notable Activity_Had a Phone Conversation 3.7311 1.176 3.171 0.002 1.425 6.037
Last Notable Activity_SMS Sent 2.7890 0.101 27.670 0.000 2.591 2.987
Last Notable Activity_Unsubscribed 2.1155 0.507 4.172 0.000 1.122 3.109
In [71]:
y_train_pred = res.predict(x_train_sm)
y_train_pred
Out[71]:
6487    0.015583
4759    0.357612
4368    0.022982
1467    0.900540
5517    0.022982
          ...   
5734    0.004461
5191    0.357612
5390    0.990270
860     0.004461
7270    0.357612
Length: 7392, dtype: float64
In [72]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]
Out[72]:
array([0.01558295, 0.35761196, 0.02298228, 0.90053976, 0.02298228,
       0.00446149, 0.35761196, 0.35761196, 0.97156333, 0.02298228])
In [73]:
# y_train_pred_final = pd.DataFrame({'Converted':y_train.values,'converted_prob':y_train_pred})
# y_train_pred_final['Prospect ID'] = y_train.index
# y_train_pred_final

y_train_pred_final['converted_prob'] = y_train_pred
In [74]:
y_train_pred_final['Predicted'] = y_train_pred_final.converted_prob.map(lambda x : 1 if x>0.5 else 0)
y_train_pred_final
Out[74]:
Converted converted_prob Prospect ID Predicted
0 0.0 0.015583 6487 0
1 0.0 0.357612 4759 0
2 0.0 0.022982 4368 0
3 1.0 0.900540 1467 1
4 0.0 0.022982 5517 0
... ... ... ... ...
7387 0.0 0.004461 5734 0
7388 0.0 0.357612 5191 0
7389 1.0 0.990270 5390 1
7390 0.0 0.004461 860 0
7391 1.0 0.357612 7270 0

7392 rows × 4 columns

In [75]:
from sklearn import metrics
confusion_matrix2 = metrics.confusion_matrix(y_train_pred_final.Converted,y_train_pred_final.Predicted)
confusion_matrix2
Out[75]:
array([[4428,  144],
       [1061, 1759]], dtype=int64)
In [76]:
accuracy_score = metrics.accuracy_score(y_train_pred_final.Converted,y_train_pred_final.Predicted)
accuracy_score
Out[76]:
0.8369859307359307
In [77]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif['Features'] = x_train[col].columns
vif['VIF'] = [variance_inflation_factor(x_train[col].values,i) for i in range(x_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF',ascending = False)
vif
Out[77]:
Features VIF
1 Lead Origin_Lead Add Form 1.49
12 Last Notable Activity_SMS Sent 1.44
7 Tags_Will revert after reading the email 1.37
2 Lead Source_Welingak Website 1.22
4 Tags_Closed by Horizzon 1.16
0 Do Not Email 1.13
6 Tags_Ringing 1.09
13 Last Notable Activity_Unsubscribed 1.06
3 Tags_Busy 1.03
9 Tags_switched off 1.03
8 Tags_invalid number 1.01
10 Tags_wrong number given 1.01
5 Tags_Lost to EINS 1.00
11 Last Notable Activity_Had a Phone Conversation 1.00
In [78]:
# Let's drop the column 'Tags_wrong number given' which has high p-value
col = col.drop('Tags_wrong number given')
col
Out[78]:
Index(['Do Not Email', 'Lead Origin_Lead Add Form',
       'Lead Source_Welingak Website', 'Tags_Busy', 'Tags_Closed by Horizzon',
       'Tags_Lost to EINS', 'Tags_Ringing',
       'Tags_Will revert after reading the email', 'Tags_invalid number',
       'Tags_switched off', 'Last Notable Activity_Had a Phone Conversation',
       'Last Notable Activity_SMS Sent', 'Last Notable Activity_Unsubscribed'],
      dtype='object')
In [79]:
x_train_sm = sm.add_constant(x_train[col])
logm4 = sm.GLM(y_train,x_train_sm, sm.families.Binomial())
res = logm4.fit()
res.summary()
Out[79]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7378
Model Family: Binomial Df Model: 13
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2618.8
Date: Fri, 26 Sep 2025 Deviance: 5237.6
Time: 20:13:22 Pearson chi2: 1.25e+04
No. Iterations: 8 Pseudo R-squ. (CS): 0.4626
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -3.8382 0.177 -21.672 0.000 -4.185 -3.491
Do Not Email -1.5355 0.169 -9.065 0.000 -1.868 -1.204
Lead Origin_Lead Add Form 2.4263 0.210 11.537 0.000 2.014 2.838
Lead Source_Welingak Website 1.9712 1.035 1.905 0.057 -0.057 3.999
Tags_Busy 2.9297 0.272 10.779 0.000 2.397 3.462
Tags_Closed by Horizzon 8.1899 0.734 11.160 0.000 6.752 9.628
Tags_Lost to EINS 7.3701 0.539 13.666 0.000 6.313 8.427
Tags_Ringing -1.5545 0.279 -5.572 0.000 -2.101 -1.008
Tags_Will revert after reading the email 3.2550 0.179 18.173 0.000 2.904 3.606
Tags_invalid number -1.8778 1.039 -1.807 0.071 -3.914 0.158
Tags_switched off -2.3915 0.740 -3.231 0.001 -3.842 -0.941
Last Notable Activity_Had a Phone Conversation 3.7602 1.188 3.165 0.002 1.432 6.089
Last Notable Activity_SMS Sent 2.7706 0.100 27.736 0.000 2.575 2.966
Last Notable Activity_Unsubscribed 2.1239 0.508 4.185 0.000 1.129 3.119
In [80]:
y_train_pred = res.predict(x_train_sm).values.reshape(-1)
y_train_pred[:10]
Out[80]:
array([0.01540345, 0.35820124, 0.02107786, 0.89911781, 0.02107786,
       0.00452884, 0.35820124, 0.35820124, 0.97158096, 0.02107786])
In [81]:
y_train_pred_final['converted_prob'] = y_train_pred
In [82]:
# Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0
y_train_pred_final['Predicted'] = y_train_pred_final.converted_prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()
Out[82]:
Converted converted_prob Prospect ID Predicted
0 0.0 0.015403 6487 0
1 0.0 0.358201 4759 0
2 0.0 0.021078 4368 0
3 1.0 0.899118 1467 1
4 0.0 0.021078 5517 0
In [83]:
confusion_matrix3 = metrics.confusion_matrix(y_train_pred_final.Converted,y_train_pred_final.Predicted)
confusion_matrix3
Out[83]:
array([[4428,  144],
       [1061, 1759]], dtype=int64)
In [84]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted))
0.8369859307359307
Let's now check the VIFs again¶
In [85]:
vif = pd.DataFrame()
vif['Features'] = x_train[col].columns
vif['VIF'] = [variance_inflation_factor(x_train[col].values,i) for i in range(x_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF',ascending = False)
vif
Out[85]:
Features VIF
1 Lead Origin_Lead Add Form 1.49
11 Last Notable Activity_SMS Sent 1.43
7 Tags_Will revert after reading the email 1.36
2 Lead Source_Welingak Website 1.22
4 Tags_Closed by Horizzon 1.16
0 Do Not Email 1.12
6 Tags_Ringing 1.09
12 Last Notable Activity_Unsubscribed 1.06
3 Tags_Busy 1.03
9 Tags_switched off 1.03
8 Tags_invalid number 1.01
5 Tags_Lost to EINS 1.00
10 Last Notable Activity_Had a Phone Conversation 1.00
In [86]:
# Let's drop the column 'Tags_wrong number given' which has high p-value
col = col.drop('Tags_invalid number')
col
Out[86]:
Index(['Do Not Email', 'Lead Origin_Lead Add Form',
       'Lead Source_Welingak Website', 'Tags_Busy', 'Tags_Closed by Horizzon',
       'Tags_Lost to EINS', 'Tags_Ringing',
       'Tags_Will revert after reading the email', 'Tags_switched off',
       'Last Notable Activity_Had a Phone Conversation',
       'Last Notable Activity_SMS Sent', 'Last Notable Activity_Unsubscribed'],
      dtype='object')
In [87]:
x_train_sm = sm.add_constant(x_train[col])
logm5 = sm.GLM(y_train,x_train_sm, sm.families.Binomial())
res = logm5.fit()
res.summary()
Out[87]:
Generalized Linear Model Regression Results
Dep. Variable: Converted No. Observations: 7392
Model: GLM Df Residuals: 7379
Model Family: Binomial Df Model: 12
Link Function: Logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2621.6
Date: Fri, 26 Sep 2025 Deviance: 5243.3
Time: 20:13:28 Pearson chi2: 1.26e+04
No. Iterations: 8 Pseudo R-squ. (CS): 0.4622
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
const -3.9547 0.176 -22.453 0.000 -4.300 -3.609
Do Not Email -1.5310 0.169 -9.052 0.000 -1.863 -1.199
Lead Origin_Lead Add Form 2.4335 0.211 11.533 0.000 2.020 2.847
Lead Source_Welingak Website 1.9628 1.035 1.897 0.058 -0.065 3.991
Tags_Busy 3.0538 0.270 11.316 0.000 2.525 3.583
Tags_Closed by Horizzon 8.3056 0.734 11.321 0.000 6.868 9.744
Tags_Lost to EINS 7.4861 0.539 13.889 0.000 6.430 8.543
Tags_Ringing -1.4207 0.276 -5.144 0.000 -1.962 -0.879
Tags_Will revert after reading the email 3.3740 0.178 18.988 0.000 3.026 3.722
Tags_switched off -2.2562 0.739 -3.053 0.002 -3.705 -0.808
Last Notable Activity_Had a Phone Conversation 3.7999 1.204 3.156 0.002 1.440 6.160
Last Notable Activity_SMS Sent 2.7491 0.099 27.823 0.000 2.555 2.943
Last Notable Activity_Unsubscribed 2.1073 0.506 4.168 0.000 1.116 3.098
In [88]:
y_train_pred = res.predict(x_train_sm).values.reshape(-1)
y_train_pred[:10]
Out[88]:
array([0.01540822, 0.35877541, 0.01880419, 0.89737471, 0.01880419,
       0.00460772, 0.35877541, 0.35877541, 0.97156911, 0.01880419])
In [89]:
y_train_pred_final['converted_prob'] = y_train_pred
In [90]:
# Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0
y_train_pred_final['Predicted'] = y_train_pred_final.converted_prob.map(lambda x: 1 if x > 0.5 else 0)
y_train_pred_final.head()
Out[90]:
Converted converted_prob Prospect ID Predicted
0 0.0 0.015408 6487 0
1 0.0 0.358775 4759 0
2 0.0 0.018804 4368 0
3 1.0 0.897375 1467 1
4 0.0 0.018804 5517 0
In [91]:
confusion_matrix4 = metrics.confusion_matrix(y_train_pred_final.Converted,y_train_pred_final.Predicted)
confusion_matrix4
Out[91]:
array([[4435,  137],
       [1069, 1751]], dtype=int64)
In [92]:
# Let's check the overall accuracy.
print(metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.Predicted))
0.8368506493506493
In [93]:
vif = pd.DataFrame()
vif['Features'] = x_train[col].columns
vif['VIF'] = [variance_inflation_factor(x_train[col].values,i) for i in range(x_train[col].shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF',ascending = False)
vif
Out[93]:
Features VIF
1 Lead Origin_Lead Add Form 1.49
10 Last Notable Activity_SMS Sent 1.43
7 Tags_Will revert after reading the email 1.36
2 Lead Source_Welingak Website 1.22
4 Tags_Closed by Horizzon 1.16
0 Do Not Email 1.12
6 Tags_Ringing 1.09
11 Last Notable Activity_Unsubscribed 1.06
3 Tags_Busy 1.03
8 Tags_switched off 1.03
5 Tags_Lost to EINS 1.00
9 Last Notable Activity_Had a Phone Conversation 1.00

All variables have a good value of VIF. So we need not drop any more variables and we can proceed with making predictions using this model only¶

Metrics beyond Accuracy¶

In [94]:
confusion_matrix5 = metrics.confusion_matrix(y_train_pred_final.Converted,y_train_pred_final.Predicted)
confusion_matrix5
Out[94]:
array([[4435,  137],
       [1069, 1751]], dtype=int64)
In [95]:
TP = confusion_matrix[1,1]  # True positive
FP = confusion_matrix[0,1]  # False positive
TN = confusion_matrix[0,0]  # True Negative
FN = confusion_matrix[1,0]  # False Negative
In [96]:
# Let's find out the sesitivity of the model
In [97]:
 TP/float(TP+FN)
Out[97]:
0.624113475177305
In [98]:
# let's findout the specificity
TN/float(TN+FP)
Out[98]:
0.968503937007874
In [99]:
# Calculate false postive rate - predicting churn when customer does not have churned
FP/float(FP+TN)
Out[99]:
0.031496062992125984
In [100]:
# positive predictive rate
TP/float(TP+FP)
Out[100]:
0.9243697478991597
In [101]:
# negative predictive rate
TN/float(TN+FN)
Out[101]:
0.8068513119533528

Step-9 : Plotting the ROC curve¶

In [102]:
def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return None
In [103]:
fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Converted, y_train_pred_final.converted_prob, drop_intermediate = False )
In [104]:
draw_roc(y_train_pred_final.Converted, y_train_pred_final.converted_prob)
No description has been provided for this image

Step-10: Finding the optimal cutoff point¶

In [105]:
# let's create columns with different cuttoff points

numbers = [float(x)/10 for x in range(10)]

for i in numbers:
    y_train_pred_final[i] = y_train_pred_final.converted_prob.map(lambda x : 1 if x>i else 0)

y_train_pred_final.head()
Out[105]:
Converted converted_prob Prospect ID Predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 0.0 0.015408 6487 0 1 0 0 0 0 0 0 0 0 0
1 0.0 0.358775 4759 0 1 1 1 1 0 0 0 0 0 0
2 0.0 0.018804 4368 0 1 0 0 0 0 0 0 0 0 0
3 1.0 0.897375 1467 1 1 1 1 1 1 1 1 1 1 0
4 0.0 0.018804 5517 0 1 0 0 0 0 0 0 0 0 0
In [106]:
# Now lets calculate Accuracy, sensitivity and  Specificity for various cutoff probabalities

cutoff_df = pd.DataFrame(columns = ['prob','Accuracy','sensitivity','specificity'])

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix(y_train_pred_final.Converted,y_train_pred_final[i])
    total1=sum(sum(cm1))
    Accuracy = (cm1[0,0]+cm1[1,1])/total1
    specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1])

    cutoff_df.loc[i] = [i, Accuracy, sensitivity, specificity]
print(cutoff_df)
                         
     prob  Accuracy  sensitivity  specificity
0.0   0.0  0.381494     1.000000     0.000000
0.1   0.1  0.693588     0.981915     0.515748
0.2   0.2  0.723079     0.976241     0.566929
0.3   0.3  0.725649     0.957447     0.582677
0.4   0.4  0.835363     0.624113     0.965661
0.5   0.5  0.836851     0.620922     0.970035
0.6   0.6  0.836580     0.618085     0.971347
0.7   0.7  0.834145     0.609220     0.972878
0.8   0.8  0.834010     0.608865     0.972878
0.9   0.9  0.705357     0.230851     0.998031
In [107]:
# let's plot the accuracy, specificity, sensitivity 
cutoff_df.plot.line(x='prob',y=['Accuracy','sensitivity','specificity'])
plt.show()
No description has been provided for this image

From the curve above, 0.4 is the optimum point to take it as a cutoff probability.¶

In [108]:
y_train_pred_final['final_predicted'] = y_train_pred_final.converted_prob.map( lambda x: 1 if x > 0.4 else 0)

y_train_pred_final.head()
Out[108]:
Converted converted_prob Prospect ID Predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 final_predicted
0 0.0 0.015408 6487 0 1 0 0 0 0 0 0 0 0 0 0
1 0.0 0.358775 4759 0 1 1 1 1 0 0 0 0 0 0 0
2 0.0 0.018804 4368 0 1 0 0 0 0 0 0 0 0 0 0
3 1.0 0.897375 1467 1 1 1 1 1 1 1 1 1 1 0 1
4 0.0 0.018804 5517 0 1 0 0 0 0 0 0 0 0 0 0
In [109]:
# Accuracy score
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
Out[109]:
0.8353625541125541
In [110]:
# Confusion matrix
confusion_matrix2 = metrics.confusion_matrix(y_train_pred_final.Converted,y_train_pred_final.final_predicted)
confusion_matrix2
Out[110]:
array([[4415,  157],
       [1060, 1760]], dtype=int64)
In [111]:
TP = confusion_matrix2[1,1]
FP = confusion_matrix2[0,1]
TN = confusion_matrix2[0,0]
FN = confusion_matrix2[1,0]
In [112]:
# Sensitivity
TP/float(TP+FN)
Out[112]:
0.624113475177305
In [113]:
# specificity
TN/float(TN+FP)
Out[113]:
0.965660542432196
In [114]:
# False positive rate
FP/float(TN+FP)
Out[114]:
0.03433945756780402
In [115]:
# positive predictive rate
TP/float(TP+FP)
Out[115]:
0.9181011997913406
In [116]:
# negative predictive value
TN/float(TN+FN)
Out[116]:
0.806392694063927

Precision and Recall¶

In [117]:
Precision = TP/float(FP+TP)
Precision
Out[117]:
0.9181011997913406
In [118]:
Recall = TP/float(TP+FN)
Recall
Out[118]:
0.624113475177305
In [119]:
# Using sklearn utilities for the same
from sklearn.metrics import precision_score, recall_score, precision_recall_curve

precision = precision_score(y_train_pred_final.Converted,y_train_pred_final.final_predicted)
precision
Out[119]:
0.9181011997913406
In [120]:
Recall = recall_score(y_train_pred_final.Converted,y_train_pred_final.final_predicted)
Recall
Out[120]:
0.624113475177305

Precision and recall tradeoff¶

In [121]:
p, r, thresholds = precision_recall_curve(y_train_pred_final.Converted, y_train_pred_final.converted_prob)
In [122]:
plt.plot(thresholds, p[:-1], "g-")
plt.plot(thresholds, r[:-1], "r-")
plt.show()
No description has been provided for this image

Step - 11: Making predictions on the test data (Model Evaluation)¶

In [123]:
# Scaling the continous variables in the test dataset
x_test[['TotalVisits','Total Time Spent on Website','Page Views Per Visit']]= scaler.transform(x_test[['TotalVisits',
                                                                                                             'Total Time Spent on Website',
                                                                                                             'Page Views Per Visit']])
In [124]:
x_test = x_test[col]
x_test.head()
Out[124]:
Do Not Email Lead Origin_Lead Add Form Lead Source_Welingak Website Tags_Busy Tags_Closed by Horizzon Tags_Lost to EINS Tags_Ringing Tags_Will revert after reading the email Tags_switched off Last Notable Activity_Had a Phone Conversation Last Notable Activity_SMS Sent Last Notable Activity_Unsubscribed
4608 0 0 0 0 1 0 0 0 0 0 0 0
7935 1 0 0 0 0 0 0 1 0 0 0 0
4043 1 0 0 0 0 0 0 1 0 0 0 0
7821 0 0 0 0 0 0 1 0 0 0 0 0
856 0 0 0 0 0 0 0 1 0 0 0 0
In [125]:
# Add constant to the x_test variables
x_test_sm = sm.add_constant(x_test)
x_test_sm.shape
Out[125]:
(1848, 13)
In [126]:
# Making predictions on test data
y_test_pred = res.predict(x_test_sm)
y_test_pred
Out[126]:
4608    0.987269
7935    0.107967
4043    0.107967
7821    0.004608
856     0.358775
          ...   
7387    0.358775
3063    0.971569
603     0.004608
4210    0.358775
7352    0.004608
Length: 1848, dtype: float64
In [127]:
# Creating a dataframe "y_pred_1"
y_pred_1 = pd.DataFrame(y_test_pred)
In [128]:
# creating a dataframe "y_test_df"
y_test_df = pd.DataFrame(y_test)
In [129]:
# adding prospect id column
y_test_df['Prospect_ID'] = y_test_df.index
In [130]:
# Removing index for both dataframes to append them side by side 
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)
In [131]:
# Appending y_test_df and y_pred_1 datafromes 
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)
y_pred_final.head()
Out[131]:
Converted Prospect_ID 0
0 1.0 4608 0.987269
1 0.0 7935 0.107967
2 0.0 4043 0.107967
3 0.0 7821 0.004608
4 0.0 856 0.358775
In [132]:
# Renaming the column
y_pred_final = y_pred_final.rename(columns = {0: 'Converted_prob'})
y_pred_final
Out[132]:
Converted Prospect_ID Converted_prob
0 1.0 4608 0.987269
1 0.0 7935 0.107967
2 0.0 4043 0.107967
3 0.0 7821 0.004608
4 0.0 856 0.358775
... ... ... ...
1843 1.0 7387 0.358775
1844 1.0 3063 0.971569
1845 0.0 603 0.004608
1846 1.0 4210 0.358775
1847 0.0 7352 0.004608

1848 rows × 3 columns

In [133]:
# Rearranging the columns
y_pred_final = y_pred_final.reindex(['Prospect_ID','Converted','Converted_prob'],axis=1)
y_pred_final.head()
Out[133]:
Prospect_ID Converted Converted_prob
0 4608 1.0 0.987269
1 7935 0.0 0.107967
2 4043 0.0 0.107967
3 7821 0.0 0.004608
4 856 0.0 0.358775
In [134]:
# Mapping the converted prob to 0 or 1
y_pred_final['final_predicted'] = y_pred_final.Converted_prob.map(lambda x: 1 if x>0.38 else 0)
In [135]:
y_pred_final.head()
Out[135]:
Prospect_ID Converted Converted_prob final_predicted
0 4608 1.0 0.987269 1
1 7935 0.0 0.107967 0
2 4043 0.0 0.107967 0
3 7821 0.0 0.004608 0
4 856 0.0 0.358775 0
In [136]:
# calculating the lead scores (0-100)
y_pred_final['Lead score'] = (y_pred_final['Converted_prob']*100).round(0).astype(int)
y_pred_final
Out[136]:
Prospect_ID Converted Converted_prob final_predicted Lead score
0 4608 1.0 0.987269 1 99
1 7935 0.0 0.107967 0 11
2 4043 0.0 0.107967 0 11
3 7821 0.0 0.004608 0 0
4 856 0.0 0.358775 0 36
... ... ... ... ... ...
1843 7387 1.0 0.358775 0 36
1844 3063 1.0 0.971569 1 97
1845 603 0.0 0.004608 0 0
1846 4210 1.0 0.358775 0 36
1847 7352 0.0 0.004608 0 0

1848 rows × 5 columns

In [137]:
# Accuracy
metrics.accuracy_score(y_pred_final.Converted,y_pred_final.final_predicted)
Out[137]:
0.8284632034632035
In [138]:
confusion3 = metrics.confusion_matrix(y_pred_final.Converted,y_pred_final.final_predicted)
confusion3
Out[138]:
array([[1061,   46],
       [ 271,  470]], dtype=int64)
In [139]:
TP = confusion3[1,1] # true positive 
TN = confusion3[0,0] # true negatives
FP = confusion3[0,1] # false positives
FN = confusion3[1,0] # false negatives
In [140]:
# Let's see the sensitivity of our logistic regression model
TP / float(TP+FN)
Out[140]:
0.6342780026990553
In [141]:
# Let us calculate specificity
TN / float(TN+FP)
Out[141]:
0.9584462511291779
In [142]:
# Precision
Precision = TP/float(FP+TP)
Precision
Out[142]:
0.9108527131782945
In [143]:
# Recall
Recall = TP/float(TP+FN)
Recall
Out[143]:
0.6342780026990553

Generating lead scores (0-100)¶

In [144]:
# Tagging the users as "very hot leads",'warm leads", "cold leads" based on their lead score
y_pred_final['Interpretation'] = y_pred_final['Lead score'].map(lambda x: 'Very Hot Leads' if x>80 else 'Warm leads' if (50<x<79) else 'Cold Leads' )
y_pred_final
Out[144]:
Prospect_ID Converted Converted_prob final_predicted Lead score Interpretation
0 4608 1.0 0.987269 1 99 Very Hot Leads
1 7935 0.0 0.107967 0 11 Cold Leads
2 4043 0.0 0.107967 0 11 Cold Leads
3 7821 0.0 0.004608 0 0 Cold Leads
4 856 0.0 0.358775 0 36 Cold Leads
... ... ... ... ... ... ...
1843 7387 1.0 0.358775 0 36 Cold Leads
1844 3063 1.0 0.971569 1 97 Very Hot Leads
1845 603 0.0 0.004608 0 0 Cold Leads
1846 4210 1.0 0.358775 0 36 Cold Leads
1847 7352 0.0 0.004608 0 0 Cold Leads

1848 rows × 6 columns

Answering Business Questions¶

    1. Which are the top three variables in your model which contribute most towards the probability of a lead getting converted?
    1. What are the top 3 categorical/dummy variables in the model which should be focused the most on to increase the probability of lead conversion?
    1. X Education has a period of 2 months every year during which they hire some interns. The sales team has around 10 interns allotted to them. So, during this phase, they wish to make the lead conversion more aggressive. So, they want almost all the potential leads (i.e., the customers who have been predicted as 1 by the model) to be converted and hence, want to make phone calls to as much of such people as possible. Suggest a good strategy they should employ at this stage.
    1. Similarly, at times, the company reaches its target for a quarter before the deadline. During this time, the company wants the sales team to focus on some new work as well. So, during this time, the company’s aim is to not make phone calls unless it’s extremely necessary, i.e., they want to minimize the rate of useless phone calls. Suggest a strategy they should employ at this stage.

Question -1:¶

In [145]:
import pandas as pd
import numpy as np

# get feature names + coefficients directly from statsmodels
coef_df = pd.DataFrame({
    'Feature': res.params.index,
    'Coefficient': res.params.values
})
coef_df = coef_df[coef_df['Feature'] != 'const']
# add absolute coefficient for ranking
coef_df['Abs_Coefficient'] = np.abs(coef_df['Coefficient'])

# sort by absolute coefficient
coef_df = coef_df.sort_values(by='Abs_Coefficient', ascending=False)

# top 3 variables
top_3 = coef_df.head(3)
print(top_3)
                                           Feature  Coefficient  \
5                          Tags_Closed by Horizzon     8.305587   
6                                Tags_Lost to EINS     7.486129   
10  Last Notable Activity_Had a Phone Conversation     3.799912   

    Abs_Coefficient  
5          8.305587  
6          7.486129  
10         3.799912  

Question -2:¶

In [146]:
# Assuming 'res' is your fitted statsmodels Logit/GLM model
coefficients = res.params  

# Put into dataframe
coef_df = pd.DataFrame({
    'Feature': coefficients.index,
    'Coefficient': coefficients.values,
    'Abs_Coefficient': np.abs(coefficients.values)
})

# Keep only categorical (dummy) variables
dummy_vars = [col for col in coef_df['Feature'] if "_" in col]  # dummies usually have "_" in names
coef_df = coef_df[coef_df['Feature'].isin(dummy_vars)]

# Sort by absolute coefficient
coef_df = coef_df.sort_values(by='Abs_Coefficient', ascending=False)

# Get top 3
top3 = coef_df.head(3)
print(top3)
                                           Feature  Coefficient  \
5                          Tags_Closed by Horizzon     8.305587   
6                                Tags_Lost to EINS     7.486129   
10  Last Notable Activity_Had a Phone Conversation     3.799912   

    Abs_Coefficient  
5          8.305587  
6          7.486129  
10         3.799912  

Question -3:¶

Here, the goal of X Education during the 2-month intern period is very different from the normal business-as-usual mode:

  • Normal case: Sales team has limited capacity → need high precision (focus only on best, “hottest” leads).

  • Intern phase: More manpower (10 interns) → can afford to reach out to more leads → need high recall (don’t miss out on potential leads, even if some extra non-converters slip in).

By reducing the Decision Cutoff (Threshold) we can achieve this¶

In [147]:
# Reducing decision cuttoff to 0.3
y_train_pred_final['final_predicted'] = y_train_pred_final.converted_prob.map( lambda x: 1 if x > 0.3 else 0)

y_train_pred_final.head()
Out[147]:
Converted converted_prob Prospect ID Predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 final_predicted
0 0.0 0.015408 6487 0 1 0 0 0 0 0 0 0 0 0 0
1 0.0 0.358775 4759 0 1 1 1 1 0 0 0 0 0 0 1
2 0.0 0.018804 4368 0 1 0 0 0 0 0 0 0 0 0 0
3 1.0 0.897375 1467 1 1 1 1 1 1 1 1 1 1 0 1
4 0.0 0.018804 5517 0 1 0 0 0 0 0 0 0 0 0 0
In [148]:
# Accuracy score
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
Out[148]:
0.7256493506493507
In [149]:
# confusion Matrix 
confusion_matrix2 = metrics.confusion_matrix(y_train_pred_final.Converted,y_train_pred_final.final_predicted)
In [150]:
TP = confusion_matrix2[1,1] # True positive
FP = confusion_matrix2[0,1] # False positive
TN = confusion_matrix2[0,0] # True negative
FN = confusion_matrix2[1,0] # False negative

Precision

In [151]:
Precision = TP/float(FP+TP)
Precision
Out[151]:
0.5859375

Recall

In [152]:
Recall = TP/float(TP+FN)
Recall
Out[152]:
0.9574468085106383

Question-4:¶

When the quarterly sales target is already met, the company should raise the cutoff probability (e.g., to 0.6–0.7) and focus only on the highest scoring leads (e.g., score 80–100). This will maximize precision and minimize wasted phone calls, while still ensuring that the very best opportunities are pursued.

In [153]:
# Increasing decision cuttoff to 0.3
y_train_pred_final['final_predicted'] = y_train_pred_final.converted_prob.map( lambda x: 1 if x > 0.7 else 0)

y_train_pred_final.head()
Out[153]:
Converted converted_prob Prospect ID Predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 final_predicted
0 0.0 0.015408 6487 0 1 0 0 0 0 0 0 0 0 0 0
1 0.0 0.358775 4759 0 1 1 1 1 0 0 0 0 0 0 0
2 0.0 0.018804 4368 0 1 0 0 0 0 0 0 0 0 0 0
3 1.0 0.897375 1467 1 1 1 1 1 1 1 1 1 1 0 1
4 0.0 0.018804 5517 0 1 0 0 0 0 0 0 0 0 0 0
In [154]:
# Accuracy score
metrics.accuracy_score(y_train_pred_final.Converted, y_train_pred_final.final_predicted)
Out[154]:
0.8341450216450217
In [155]:
# confusion Matrix 
confusion_matrix3 = metrics.confusion_matrix(y_train_pred_final.Converted,y_train_pred_final.final_predicted)
In [156]:
TP = confusion_matrix3[1,1] # True positive
FP = confusion_matrix3[0,1] # False positive
TN = confusion_matrix3[0,0] # True negative
FN = confusion_matrix3[1,0] # False negative

Precision

In [157]:
Precision = TP/float(FP+TP)
Precision
Out[157]:
0.9326818675352877

Recall

In [158]:
Recall = TP/float(TP+FN)
Recall
Out[158]:
0.6092198581560284
In [ ]:
 
In [ ]: